LE CODE EN COMMENTAIRE, C’EST JUSTE POUR POUVOIR KNIT PLUS VITE!! (avec cache = TRUE, ca re-run pas le code, mais le markdown doit quand meme compiler les graphes, du coup ca prend du temps à chq fois…)

names(cred) <- tolower(names(cred)) # changing the name of the variable into lowercase
cred <- cred[,-1] # deleting the first useless column
sum(is.na(cred)) # checking if we have missing data

Introduction

Parler de: - méthode utilisée - grandes étapes - classification task (goal) =/= prediction - [à compléter…]

–> Exploration des données = 1st insight –> Modelling: train/test set + selection du “meilleur” arbre(pruning)/svm(tuning)/NNET(nb of layers)/regression logicstic/K-NN(choisir K)/etc. + presentation de détail de tout ce bordel –> Cross validation avec les meilleurs de chq models pour selectionner le grand gagnant…

–> Description des variables (comme ca c’est plus simple de s’en sortir?!)

Exploratory analysis

Explanatory variables

Overview

Before beginning any kind of analysis, we have to understand the data we are working with.

data.frame(variable = names(cred),
           classe = sapply(cred, typeof),
           first_values = sapply(cred, 
                                 function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
  kable(caption="Overview of our data") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F) %>%
  column_spec(1, width = "10em", border_right = T) %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "18em") %>%
  scroll_box(width = "65%", height = "250px")
Overview of our data
variable classe first_values
chk_acct integer 0, 1, 3, 0, 0, 3
duration integer 6, 48, 12, 42, 24, 36
history integer 4, 2, 4, 2, 3, 2
new_car integer 0, 0, 0, 0, 1, 0
used_car integer 0, 0, 0, 0, 0, 0
furniture integer 0, 0, 0, 1, 0, 0
radio.tv integer 1, 1, 0, 0, 0, 0
education integer 0, 0, 1, 0, 0, 1
retraining integer 0, 0, 0, 0, 0, 0
amount integer 1169, 5951, 2096, 7882, 4870, 9055
sav_acct integer 4, 0, 0, 0, 0, 4
employment integer 4, 2, 3, 3, 2, 2
install_rate integer 4, 2, 2, 2, 3, 2
male_div integer 0, 0, 0, 0, 0, 0
male_single integer 1, 0, 1, 1, 1, 1
male_mar_or_wid integer 0, 0, 0, 0, 0, 0
co.applicant integer 0, 0, 0, 0, 0, 0
guarantor integer 0, 0, 0, 1, 0, 0
present_resident integer 4, 2, 3, 4, 4, 4
real_estate integer 1, 1, 1, 0, 0, 0
prop_unkn_none integer 0, 0, 0, 0, 1, 1
age integer 67, 22, 49, 45, 53, 35
other_install integer 0, 0, 0, 0, 0, 0
rent integer 0, 0, 0, 0, 0, 0
own_res integer 1, 1, 1, 0, 0, 0
num_credits integer 2, 1, 1, 1, 2, 1
job integer 2, 2, 1, 2, 2, 1
num_dependents integer 1, 1, 2, 2, 2, 2
telephone integer 1, 0, 0, 0, 0, 1
foreign integer 0, 0, 0, 0, 0, 0
response integer 1, 0, 1, 1, 0, 1

 

As we can see, the data are coherent with the infos that “the client” provided us. Most of them are binary or categorical, while only few are numerical.

More than seeing the first values of our variables and their types, we also need to understand how distributed they are and their link with each other. Thanks to a correlation plot, we can see the correlation between each pair of variable, but especially their correlation with our response variable in which we are interested in.
We see for instances that variables like chk_acct, duration, history, sav_accnt or rent are highly correlated (positively or negatively) with our outcome variable and that they will be likeli to influe it in the models that we are going to plot. Others like present_resident or retaining should have low impact.

plot_correlation(cred, type="all", title = "Correlation Graph") # Attention! Il faut seulement ploter avec les valeurs continues. Les catégoriques se feront avec le khi-2! Test de wilkockson pour tester si les groupes sont différents (comme je le comprends, voir si une valeur cotinue a une influence sur l'outcome des groupes). Ensuite, fait des tests d'indépendance sur les discrètes pour voir si elles ont une influence sur les valeurs de réponse. 

In addition, we can appreciate the summary of the different variables. The frequency table of history is presented below as an example:

Summary of variable history
Value
Min. 0.000
1st Qu. 2.000
Median 2.000
Mean 2.545
3rd Qu. 4.000
Max. 4.000
Frequency table of history
Values Frequency
0 40
1 49
2 530
3 88
4 293

 
 

However, presenting such a summary for all variables can be long and boring. It can be better to represent these number visually. A Boxplot is optimal to get all the important values for the numerical data, while a barplot will give us strong insights for categorical data. Let’s appreciate the following graphs:

for (i in 1:(length(cred)-1)) {
  if (range(cred[, i] < 5)) {
    print(
      ggplot(cred, aes(x = cred[, i])) +
        geom_bar(stat = "count", position = "dodge") +
        ggtitle(str_c("Barplot of\n", paste(
          colnames(cred[i])
        ))) +
        xlab(colnames(cred[i])) +
        ylab("Total") +
        my_theme()
    )
  } else
  {
    print(
      ggplot(cred, aes(y = cred[, i])) + geom_boxplot() +
        ylab(colnames(cred[i])) +
        ggtitle(str_c("Boxplot of\n ", paste(
          colnames(cred[i])
        ))) +
        my_theme() +
        theme(
          axis.text.x=element_blank())
    )
  }
}

Thanks to these graphs, we can better understand our data at a glance and will be able to refer to them when needed.

In addition, these graphs enable un too see that some data are not tidy. For instance, education should be a binary variable. However, we can see on the histogram of this variable that we have data where \(-1\) were recorded. We have the same problem for the binary variable guarantor were a value \(2\) is present.
In addition, we can also have strong suspitions that the variable age has wrong recorded data as we can see an outlier with a value much bigger than 100.
We will have to confirm our first assumptions and to modify these dirty data in an appropriate way.

Let’s first look at our variable age. We assume that, generally, a person will not live more than a hundred year, and will not contract a credit at such age. This is why the data with \(Age > 100\) are most likely wrongly recorded. We will therefore have to replace them in our database.
First, we have to find how much data are potentially dirty according to our assumptions and to localise them in order to replace them.

Number of instances with age > 100
Var1 Freq
FALSE 999
TRUE 1
Position of instance with age > 100
x
537

 

According to our results, we have one data with \(age > 100\) that has to be replaced. It is the instance 537 and its value is 125.

We can consider different options to replace this value. The first one could be to replace it by a value at random within the range (a value at random between 19 and 75, which is the second lowest value after \(125\).
However, according to the following histogram, the distribution of the age (without the erroneous data) is inequal with a concentration around small values (which is logical as young people generally have less money than elders and therefore are more subject to ask for credits).

It could therefore be possible to replace it at random with different probabilities according to the size of each class.
We prefer to opt for the median (equal to 33) to replace our problematic value as it offers more convenience.
Note that for calculating the median, our problematic value should not be used.

cred$age[which(age>75)] <- median(age[age!=max(age)])

An alternative could have been to use the mean, but, as we have no really big outlier, both values would have been close to eachother (\(mean = 35.596\) while \(median = 33\)).

 

Next, we also have to deal with our two categorical data that have been wrong recorded:
- one in education
- one in guarantor

They also have to be cleaned.

The following is again the barplot of education.

Here, the likelihood that this wrong recorded data is equal to \(0\) is clearly higher. Therefore, each of the previously presented methods (using the mean, using the median and even assigning it to a class at random) would with a high probability result in assigning this instance and assign it the value \(Education = 0\).
We can confirm these first assumption with a frequency table:

Frequency table of education
-1 0 1
sample size 1 950 49
proportion 0.1 % 95 % 4.9 %

 

It is indeed more appropriate to replace our value by 0 as the probability of belonging to this class is close to 20 times bigger.

cred$education[which(education==-1)] <- 0

 

Concerning the variable guarantor, we can look at the frequency table and plot the barplot as well:

Frequency table of guarantor
0 1 2
sample size 948 51 1
proportion 94.8 % 5.1 % 0.1 %

Again, for the same reasons, it is preferable to replace the wrong recorded data by 0.

cred$guarantor[which(guarantor==2)] <- 0

In addition, the variable present_resident is also problematic as it doesn’t have the same range as the other categorical values. Its range goes from 1 to 4 whereas it should go from 0 to 3 like the other ones. We can modifiy its values in order to have the same format everywhere.

cred$present_resident <- subtract(cred$present_resident, 1)

These first steps have enabled us to better understand our explanatory variables and to clean the problematic ones.
We now have to focus in detail to the response variable on which the predictions should be made.

Response variable

As our final goal is to predict if a customer should be classified as a risky one or not, we have to have a particular look at our response variable that establishes if an applicant presents a good or a bad risk.
Let’s first have a look at its distribution:

Frequency table of our response variable
0 1
sample size 300 700
proportion 30 % 70 %

As we can see, if a random customer steps in the bank, the a priori probability that he will present a good ranking will be of 70%.
Without any calculations, the bank has more chances to make a good decision when octroying a credit.
However, the consequences can be really dramatic if 30% of the credits that the bank gives are not totally reimbursed. That’s why we have to develop a model to improve this initial accuracy that is obtained using a naive method of always octroying a credit.
Optimally, this model should also minimise the number of credits that are predicted as “good” and that are actually “bad” as the consequences for the bank (reimbursement of the credit by the customer) can be much more dramatic in this situation than in a situation where a “bad” credit is predicted as “good”.

We will get to this later. First, after having presented each variable, it could be interesting to see if we can already have some assumptions concerning the relations between the explanatory variables and the response variable.

Interactions between the explanatory variables and the response variable

Text…

–> Expliquer qu’on peut deja avoir un premier apriori sur les variables qui vont avoir un impact

** PAS TOP LES BOXPLOTS POUR LES VARIABLES BINAIRES ??!!**

Modelling

Before beginning to work on our different models, we can create a test set and a training set in order to build the models. This procedure is used is order to avoid overfitting and to predict instances which have been used while building the model.
In order to evalate the performance of the different models, we will have to use the exact same training and test sets for each model to be sure that the performance differences will result from the model we use and not from the randomess of splitting differently both sets.
A Cross-Validation will be performed at the end in order to compare the different models and to choose which one should be used to make good predictions.

# Creation of testing and training sets
set.seed(1234) 
index.train <- sample(1:nrow(cred), size = nrow(cred) * 0.7, replace = FALSE)
cred.train <- cred[index.train,]
cred.test <- cred[-index.train,]

In addition, to evaluate our models, we will use the accuracy as main measure of performance. However, computing the accuracy for each model can be quite long… We prefer to build a function to be able to retrieve at any time based on a confusion matrix:

# Creating of a function to retrieve the accuracy from a confusion matrix
accuracy <- function(c){
  print(sum(diag(c))/sum(c))
}

We can now begin to work on the different models that we will use in order to make our predictions.

CART

First, we begin our analysis by using a decision (classification) tree.

The goal of a decision tree is to predict the final class of our response variable (“good” or “bad”) by using a succession of binary rules to apply to our data.
Each node is created thanks to an algorithm that aims to minimize an impurity criterion. The feature and its underlying value that maximises the impurity reduction (that “best splits”the dataset in two) will be selected. This procedure is repeated until a stopping rule is reached.
At the end, we will have multiple branches that will all lead to the final forecast that we will make for a given instances.
Better than words, let’s compute the model and have a look at it.

Building the model

cart.model <- rpart(response ~ .,  data = cred.train, method = "class")
rpart.plot(cart.model, main="Original decision tree")

–> At this time, we have obtained a decision tree that can be used to do the predictions. At it node, one should look at the values of the data and take the appropriate direction till arriving at the last row, when the prediction can be made.
However, this tree is also really complex: there are a lot of split and of branches.
We have to tune this initial model and make it more simple. We will simplify this model without loosing predictive capabilities by pruning it, keeping only the most important splits linked to the most important variables.

Tuning the model: Pruning

As already said, our complex tree has to be pruned in order to reduce its complexity. To do so, we will use the 1 - SE rule.
The idea of this rule is really general. As one would establish a t-test in statistics to see if two measures are statistically different, the 1 - SE rule tries to establish if two models produce statistically different results (if we can affirm that one outperforms the other).
We therefore consider the xerror (the criterion in which we are interested in and that we want to minimize) and its standard deviation xstd. A models that falls within 1 standard deviation of the most complex model can be considered as equivalent in term of performance. Therefore, using this rule, we will be able to prune the tree to get a much simplier model, without loosing in quality as the performance capability of our new model will be statistically the same as the first one.
To select the size of our pruned tree, we will look at the xerror of our original tree (the most complex one) and add one standard error xstd. We will then select the simplier tree with an xerror that lies within the calculated value.

Let’s have a look at these values from our original tree and decide where to prune it.

CP table of the CART model
CP nsplit rel error xerror xstd
0.0572139 0 1.0000000 1.0000000 0.0595529
0.0398010 2 0.8855721 0.9900498 0.0593745
0.0298507 3 0.8457711 0.9950249 0.0594641
0.0199005 4 0.8159204 0.9452736 0.0585352
0.0149254 8 0.7114428 0.9303483 0.0582417
0.0124378 11 0.6666667 0.9651741 0.0589157
0.0100000 13 0.6417910 0.9651741 0.0589157
Variable importance table first 6 instances
x
chk_acct 30.415366
duration 24.535462
amount 16.384878
history 9.693259
sav_acct 9.544539
employment 6.126322

Accoring to our first table, the minimal xerror is equal to 0.9303483 , the minimal xstd to 0.0582417 and the sum of both is therefore equal to 0.98859.

The smallest tree with an xerror below this value is equal to 0.9452736 and we will therefore prune the tree at CP = 0.0199005. According to our previous table, this represents 4 splits to be kept, equivalent to a tree of size 5.
It is also possible to observe this value on the graph where the dash line indicates the xerror plus its xstd from the most complex tree. We therefore select the less complex tree under this line, which is the tree of size 5 (or, again, with 4 splits).

We thus prune the tree at this value (the R function requires to indicate CP.)

cart.pruned <- prune(cart.model, cp = cp.pruned)

We can visualize this new pruned tree that is much smaller and therefore less complex than our original one and will use it to do our predictions, to build the confusion matrix and to calculate the underlying accuracy.

Predicting the values of the testing set

pred.cart <- predict(cart.pruned, newdata = cred.test, type = "class")

# confusion matrix:
cart.tab <- table(Predictions = pred.cart, Observations = cred.test$response) 
cart.tab %>% kable(caption = "Confusion matrix for the CART model",
                   col.names = c("Predict bad", "Predict good")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F,
    position = "l") %>%
  column_spec(1, border_right = T, width = "5em") %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "6em")
Confusion matrix for the CART model
Predict bad Predict good
bad 27 12
good 72 189

Based on this table, we can calculate the accuracy using our previously build function, and that calculates the element weel classified divided by the total number of elements.

# accuracy using our previously build function
accuracy(cart.tab) 
## [1] 0.72

It is possible to have more informaion, with for instance the sum of each rows with the following cross table.


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
                   | pred.cart 
cred.test$response |       bad |      good | Row Total | 
-------------------|-----------|-----------|-----------|
               bad |        27 |        72 |        99 | 
                   |     0.273 |     0.727 |     0.330 | 
                   |     0.692 |     0.276 |           | 
                   |     0.090 |     0.240 |           | 
-------------------|-----------|-----------|-----------|
              good |        12 |       189 |       201 | 
                   |     0.060 |     0.940 |     0.670 | 
                   |     0.308 |     0.724 |           | 
                   |     0.040 |     0.630 |           | 
-------------------|-----------|-----------|-----------|
      Column Total |        39 |       261 |       300 | 
                   |     0.130 |     0.870 |           | 
-------------------|-----------|-----------|-----------|

 

We can see that 99 are predicated as bad and 201 are predicted as good.

–> Parler de accuracy en elle meme (0.74 pas top par rapport aux 0.7 de base si on prédit que “good”) –> Parler des bonnes predictions de 0 (c’est elles qui nous intéressent et prédire que 1 conduirait la banque à faire faillite) –> Parler FALSE POSTIVIE / FALSE NEGATIVE / TRUE POSITIVE / TRUE NEGATIVE

Neural Network

Building the model and optimizing it: selecting the number of neurones in the hidden layer

La “littérature” (https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw) suggère de garder 1 seul layer un nb de neuronnes entre 1 et le nbr de dim: on va les tester tous et garder le meilleur… “In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.”

“Over-parametrization quickly lead to instability in the estimate and can lead to overfitting. One possibility is to impose a penalty on the largest weights during the optimization. This is called regularization and, more specifically in the context of neural network, weight decay.

–> On essaye de trouver le nombre “optimal” de neurones. Vu que neural network utilise des valeurs aléatoires au début pour lancer l’algo etc. Ca varie. On calcule donc 10 fois l’accuracy liée à chaque nombre de neurones et voir si y’a un nb de neurones optimal.

train_control <- trainControl(method = "cv", number = 5)

nnet_fit <- train(form = response~ .,
                 data = cred,
                 trControl = train_control,
                 tuneGrid = expand.grid(size = 14:25, decay = c(1.0, 1.5, 2)),
                 method = "nnet")

plot(nnet_fit)

nnet_fit$results
nnet_fit$results[which.max(nnet_fit$results$Accuracy),]$size # the size maximizing the accuracy
nnet_fit$results[which.max(nnet_fit$results$Accuracy),]$decay # the decay maximizing the accuracy

Model selection and predicting the values of the testing set

We present you the characteristics of the model retained, with 15 neurones in the hidden layer and a decay of 1. As usual, we trained the model and predict the values of the test set to be able to build the confusion matrix and to calculate the accuracy.

nnet.model.retained <- 
  nnet(cred.train$response ~ .,
       data = cred.train, maxit = 200,
       size = nnet_fit$results[which.max(nnet_fit$results$Accuracy),]$size, 
       decay = nnet_fit$results[which.max(nnet_fit$results$Accuracy),]$decay)

# Predictions on the test set:
pred.nnet.retained <- predict(nnet.model.retained, cred.test, type="class") 

# Confusion matrix
tab.nnet.retained <- table(Reality = cred.test$response, 
                           Predicted = unlist(pred.nnet.retained)) 
# Accuracy
acc.nnet.retained <- sum(ifelse(
  cred.test$response == unlist(pred.nnet.retained), 1, 0), na.rm = TRUE) /
  length(cred.test$response) 

In addition, we can plot our neural network:

This graph shows that our model can be quite complex, there a plenty of arrows. It is not necessary to understand precisely all of them and they can be seen as a “black box”, meaning that the interpretability of such a model is modest. But, afterall, what we need is to make good predictions!

Confusion matrix of the fitted NNET model
Predict good Predict bad
bad 53 46
good 30 171

We obtain a final accuracy of 0.747 on our testing set. Note that this accuracy is different from the one using the cross validation before as here, we use only one test set that may be disproportioned. We will, at the end, use again a cross validation for all models on the same train and test sets in order to compare the different models.

Support Vector Machine (SVM)

Building the model

Optimizing the model: selecting good cost and gamma parameters

Predicting the values of the testing set

K - Nearest Neighbors

Building the model and Optimizing the model: selecting the number of neighbors and the distance measure

# kknn_parameters <- expand.grid(k = 2:30, distance = 1:5, kernel= "optimal")
# 
# kknn_fit <- train(form = response~ .,
#                  data = cred,
#                  trControl = train_control,
#                  tuneGrid = knn_parameters,
#                  method = "kknn",
#                  preProcess = c("center", "scale"))
# plot(kknn_fit)
# kknn_fit$results
# 
# kknn_fit$results[which.max(knn_fit$results$Accuracy),]$k
# kknn_fit$results[which.max(knn_fit$results$Accuracy),]$distance

After having created multiple K-Nearest Neighbors models with the knn function, we realize that the distance of 2 outperforms other distances. We can therefore use the function knn to illustrate how our final model is built and how we select k (note that knn function is more efficient in term of computationnal power, only reason why we use it to choose k rather than the kknn function just above.)

We can thus use the knn function to see how the accuracy varies with different k between 2 an 30 and select the number of neighbors that lead to the highest accuracy.

knn_parameters <- expand.grid(k = 2:30)

knn_fit <- train(form = response~ .,
                 data = cred,
                 trControl = train_control,
                 tuneGrid = knn_parameters,
                 method = "knn",
                 preProcess = c("center", "scale"))
plot(knn_fit)

knn_fit$results
##     k Accuracy     Kappa AccuracySD    KappaSD
## 1   2    0.692 0.2462395 0.02683282 0.06888221
## 2   3    0.699 0.2441172 0.03286335 0.07808652
## 3   4    0.695 0.2196982 0.04138236 0.10429467
## 4   5    0.722 0.2639107 0.02387467 0.05968080
## 5   6    0.716 0.2446454 0.01816590 0.04808965
## 6   7    0.731 0.2744440 0.02534758 0.06783559
## 7   8    0.720 0.2452409 0.03221025 0.08911043
## 8   9    0.727 0.2605492 0.02413504 0.06231199
## 9  10    0.730 0.2655259 0.01457738 0.04736117
## 10 11    0.736 0.2709487 0.01294218 0.04068508
## 11 12    0.742 0.2884146 0.01483240 0.05555365
## 12 13    0.740 0.2704959 0.00500000 0.03427498
## 13 14    0.746 0.2892835 0.01083974 0.03449258
## 14 15    0.738 0.2597819 0.01680774 0.05662844
## 15 16    0.741 0.2658625 0.01140175 0.04409081
## 16 17    0.737 0.2526184 0.01151086 0.04869526
## 17 18    0.745 0.2791824 0.01541104 0.04902223
## 18 19    0.734 0.2302861 0.02133073 0.08055620
## 19 20    0.731 0.2263990 0.01635543 0.06719721
## 20 21    0.734 0.2228327 0.02434132 0.09057878
## 21 22    0.728 0.1997635 0.02079663 0.08504872
## 22 23    0.730 0.1999490 0.02423840 0.09447919
## 23 24    0.728 0.1968037 0.02109502 0.07736229
## 24 25    0.729 0.1979697 0.02247221 0.08991717
## 25 26    0.722 0.1761069 0.02252776 0.09150449
## 26 27    0.729 0.1963998 0.01917029 0.07845248
## 27 28    0.734 0.2076551 0.02724885 0.10442122
## 28 29    0.730 0.1871680 0.01369306 0.06472995
## 29 30    0.726 0.1735408 0.01596872 0.06421134
knn_fit$results[which.max(knn_fit$results$Accuracy),]$k
## [1] 14

The selected knn model uses a distance of 2 and is computed with the 14 neerest neighbors. The obtained accuracy using a 10 fold Cross-Validation is equal to 0.746.

Predicting the values of the testing set

Confusion matrix for the K-NN model
Predict bad Predict good
bad 27 12
good 72 189

Given our confusion matrix, on our initial testing set, the obtained accuracy is equal to .

Cross Validation approach

–> Expliquer pourquoi on fait une cross-validation –>

Creation of 10 different sets in order to do the Cross-Validation. The number of instances being 1000, we will make 10 sets of size 100. They are stored in a list named test.list. At each step, the remaining part of the data base is stored in the list train.test.

test.list <- list() # creates an empty list that will be the test sets
train.list <- list() # creates an empty list that will be the train sets
counter <- 0

# Creates the 10 sets of size 300
for (i in 1:10){
  index <- counter + c(1:100) # the row numbers that will be in the test set
  test.list[[i]] <- cred[index, ] # the test set number i
  train.list[[i]] <- cred[-index, ] # the train set number i
  counter <- counter + 100
}

For example, test.list[[1]] is a data set of 300 rows taken at random from our data.

# test.list[[1]] %>% kable(caption="Our first test set named test.list[[1]]") %>%
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
#                 full_width = F) %>%
#   scroll_box(width = "100%", height = "200px")
# 
# 
# train.list[[1]] %>% kable(caption="Train set associated with test.list[[1]]") %>%
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
#                 full_width = F) %>%
#   scroll_box(width = "100%", height = "200px")

CART Model

acc.cart.cv <- numeric(10)
for (i in 1:10){
  cart.cv <- rpart(response~., data=train.list[[i]]) # original tree
  
  # Tree to be prune at: 
  cp.pruned <- cart.model$cptable[(
    min(which(cart.model$cptable[, ncol(cart.model$cptable) - 1] < 
                cart.model$cptable[nrow(cart.model$cptable), 
                                   ncol(cart.model$cptable)] + 
                cart.model$cptable[nrow(cart.model$cptable), 
                                   ncol(cart.model$cptable) - 1]))), 1] 
  
  # the pruned tree for predictions
  cart.pruned.cv <- prune(cart.model, cp = cp.pruned) 
  
  # making the predictions
  cart.pred.cv <- predict(cart.pruned.cv, newdata=test.list[[i]], type="class") 
  
  # the confusion matrix
  tab.cart.cv <- table(test.list[[i]]$response, cart.pred.cv) 
  
  # the final accuracy
  acc.cart.cv[i] <- accuracy(tab.cart.cv) 
}
acc.cart.cv
##  [1] 0.75 0.68 0.77 0.72 0.72 0.59 0.70 0.68 0.71 0.68
mean(acc.cart.cv)
## [1] 0.7
sd(acc.cart.cv)
## [1] 0.04898979
# train_control <- trainControl(method = "cv", number = 10)
# 
# knn_parameters <- data.frame (k = 3:30)
# 
# knn_fit <- train(form = response~ .,
#                  data = cred,
#                  trControl = train_control,
#                  tuneGrid = knn_parameters,
#                  method = "knn",
#                  preProcess = c("center", "scale"))
# plot(knn_fit)
# knn_fit$results
train_control <- trainControl(method = "cv", number = 10)

nnet_fit <- train(form = response~ .,
                 data = cred,
                 trControl = train_control,
                 tuneGrid = expand.grid(size = 10:20, decay = 1:5),
                 method = "nnet")

plot(nnet_fit)
nnet_fit
nnet_fit$results